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Abstract 

Principal component analysis (PCA) is a key statistical technique for 
multivariate data analysis. For large data sets the common approach 
to PCA computation is based on the standard NIPALS-PCA algorithm, 
which unfortunately suffers from loss of orthogonality, and therefore it's 
applicability is usually limited to the estimation of the first few compo- 
nents. Here we present an algorithm based on Gram-Schmidt orthogonal- 
ization (called GS-PCA), which eliminates this shortcoming of NIPALS- 
PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel im- 
plementation of both NIPALS-PCA and GS-PCA algorithms. The nu- 
merical results show that the GPU parallel optimized versions, based on 
CUBLAS (NVIDIA) are substantially faster (up to 12 times) than the 
CPU optimized versions based on CBLAS (GNU Scientific Library). 
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1 Introduction 



Principal component analysis (PCA) is one of the most valuable results from ap- 
plied linear algebra, and probably the most popular method used for compacting 
higher dimensional data sets into lower dimensional ones for data analysis, visu- 
alization, feature extraction, or data compression [Jackson, 1991[ |Jolliffe, 2002| . 
PCA provides a statistically optimal way of dimensionality reduction by pro- 
jecting the data onto a lower-dimensional orthogonal subspace that captures 
as much of the variation of the data as possible. Unfortunately, PCA quickly 
becomes quite expensive to compute for high-dimensional data sets, where both 
the number of variables and samples is high. Therefore, there is a real need in 
many applications to accelerate the computation speed of PCA algorithms. For 
large data sets, the standard approach is to use an iterative algorithm which 
computes the components sequentially, and to avoid the global methods which 
calculate all the components simultaneously. NIPALS-PCA [Wold et al., 1987] 
is the most frequently used iterative algorithm, and often considered as the 
standard PCA algorithm. However, for large data matrices, or matrices that 
have a high degree of column coUinearity, NIPALS-PCA suffers from loss of or- 
thogonality, due to the errors accumulated in each iteration step [Kramer, 1998[ . 
Therefore, in practice it is only used to estimate the first few components. Here, 
we address both the speed and orthogonality problems, and we offer new solu- 
tions which eliminate these shortcomings of the iterative PCA algorithms. 

We formulate an iterative PCA algorithm based on the Gram-Schmidt re- 
orthogonalization, which we called GS-PCA. This algorithm is stable from the 
orthogonality point of view, and if necessary, it can be used to calculate the 
full set of principal components. The speed up issue is tackled with a parallel 
implementation for Graphics Processing Units (CPUs). Here, we present the 
GPU parallel implementation of both NIPALS-PCA and GS-PCA algorithms. 
The numerical results show that the GPU parallel optimized versions, based on 
CUBLAS (NVIDIA) [NVIDIA, 2008] , are substantially faster (up to 12 times) 
than the CPU optimized versions based on CBLAS (GNU Scientific Library) 
[Galassi et al, 2006[ . 

2 Methods 

2.1 Iterative Principal Component Analysis 

In the following description, the dataset to be analyzed is represented by the M x 
N matrix X. Each column, X*^"\ n = 0, — 1, contains all the observations 
of one attribute. Also, we assume that each column is mean centered, i.e. if 
X*^"^ are the original vectors then 

N-l 

X(") = X(") - N-^ ^ X(") (1) 

n=0 
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PCA transforms the set of input column vectors [X^°^|...|X'^^ into another 
set of column vectors [t'*^-* |...|T(-'^~^)], called principal component scores. This 
transformation has the property that most of the original data's information 
content (or most of its variance) is stored in the first few component scores. 
This allows reduction of the data to a smaller number of dimensions, with 
low information loss, simply by discarding the last component scores. Each 
component is a linear combination of the original inputs and each component is 
orthogonal. This linear transformation of the matrix X is specified by a x A'^ 
matrix P so that the matrix X is factorized as: 

X = TP^, (2) 

where P is known as the loadings matrix. 

There are several PCA algorithms in the literature, namely SVD (singular 
value decomposition) and NIPALS (nonlinear iterative partial least squares), 
which use the data matrix, and POWER and EVD (eigenvalue decomposition) 
which use the covariance of the data matrix [Jackson, 199ll|Jolliffe,"2002l . SVD 
and EVD extract all the principal components simultaneously, while NIPALS 
and POWER calculate them sequentially. Unfortunately, the traditional imple- 
mentation of PCA through SVD or EVD quickly becomes prohibitive for very 
large data sets. In this case, an approximate solution can be more efficiently 
obtained using the iterative approach based on the NIPALS algorithm. 



2.2 NIPALS-PCA Algorithm 



The NIPALS-PCA algorithm can be described as following [Wold et al, 1987 
[Kramer, 1998| . In the first step, the initial data X is copied into the residual 
matrix R. Then, in the next steps the algorithm extracts iteratively one com- 
ponent at a time (fc = 0, l...,K < N) by repeated regressions of X"^ on scores 
T^*^^ to obtain improved loads P^'^^ and of X on these P^*^^ to obtain improved 
scores T*^*^). After the convergence is achieved, this process is followed by a 
deflation of the data matrix: 

X^X-TW(PW)^. (3) 

The convergence test consists in comparing two successive estimates of the eigen- 
value A and A'. If the absolute difference |A' — A| is smaller than some small 
error e then the convergence is achieved and the algorithm proceeds to the de- 
flation step. Using the NIPALS-PCA algorithm approach, the decomposition of 
the data matrix X takes the form: 

X = T(^)Pf^f)+R, (4) 

where 'T(k) — [T^"'' [...IT*^^^^^] is the matrix formed using the first K scores, 

P(K) = [P*-°^ |...|P'^'^~^^] is the matrix of the first K loadings, and R is the 
residual matrix. The pseudo-code of the NIPALS-PCA algorithm is given below: 
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R ^ X 

for(fc = 0, do 
{ 

A = 

rp(fe) ^ J^(fc) 

for(j = 0,...,J) do 
{ 

p(fc) ^ R^T^'') 

p(fe) ^ p(fe) iipc^oip^ 
X' ^ iitc^) II 

if(|A' - A| < £) then break 
A ^ A' 

} 

} 

return T, P, R 



2.3 GS-PCA Algorithm 

A well known shortcoming of the NIPALS-PCA algorithm is the loss of orthog- 
onality [ Kramer, 1998| . Both, the computed scores T'''') and the loadings P'*^), 
are supposed to be orthogonal. However, because of the errors accumulated 
at each iteration step (which involves large matrix-vector operations) this or- 
thogonality is quickly lost and in practice one can compute accurately only the 
first few components. In order to stabilize the iterative PCA computation, from 
the orthogonality point of view, we propose an algorithm which is based on 
the Gram-Schmidt (GS) re-orthogonalization process. The classical GS algo- 
rithm (CGS) recursively constructs a set of orthonormal basis vectors for the 
subspace spanned by a given set of linearly independent normalized vectors 
[Golub et al., 1996] . It is well known that the CGS algorithm is also numeri- 
cally unstable due to rounding errors. However, the CGS can be easily stabilized 
by a small modification obtaining the so called modified Gram-Schmidt (MGS) 
algorithm [Bjorck et al, 1992] . Unfortunately, the MGS algorithm cannot be 
expressed by Level-2 BLAS functions (matrix-vector operations) and therefore 
it requires a substantial amount of global communications, when implemented 
on a parallel computer |Lingen, 2000) . In contrast, the CGS algorithm can be 
easily expressed using matrix-vector operations and therefore it is more suit- 
able for parallel implementation. Also, the numerical stability of CGS can be 
achieved by applying it iteratively [Lingen, 2000] . In the proposed GS-PCA 
algorithm the re-orthogonalization correction is applied to both the scores and 
the loadings at each iteration step. 

For the pseudo-code formulation of the GS-PCA algorithm we prefer to use 
the truncated SVD description, since foi K — N the algorithm also returns the 
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full SVD decomposition of the input matrix: 

X = V(K)A(;^)Uf^)+R (5) 

where V(x) and V(^k) are the first K left and respectively rigth eigenvectors, 
A(x) are the corresponding eigenvalues, and R is the residual. One can easily 
show that T(fc) = V(x)A(x) and Pfx) — ^fx)- "^^^ pseudo-code of the GS- 
PCA algorithm can be formulated as following: 



for(A: = 0,...,^^- 1) do 

{ 

At = 

Y(fe) ^ p_(fe) 
for(j = 0,..., J) do 
{ 

U(fe) ^ R^VC^) 
if(A; > 0) then 

A ^ Uffe)U('=) 

U(fe) -U(fc)A 

} 

U(fc) ^ u(fc) ||uW||~^ 

if{k > 0) then 
{ 

Y(fe) ^ Y(fe) _ Y^^^B 
} 

Afc-||vW|| 

YW ^ vf'^'/Afc 

if(|Afc — ^1 < e) then break 

} 

R^R-AfcV('=)(UW)^ 
} 

VA 

P 

return T, P, R (for PGA) or V, U, A (for SVD) 

One can see that in every iteration step, if fc > then both the right (loads) and 
the left (scores) eigenvectors are re-orthonormalized. This procedure stabilizes 
the algorithm but it also increases a little bit the computational effort. However, 
this effort will be compensated by the efficiency of the parallel implementation. 
The GS-PCA algorithm assures the perfect orthogonality of both the loads 
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and the scores. The errors accuraulated in GS-PCA are only due to to the 
desired precision e in the estimation of the eigenvalues A^. Also, iav K — N 
the GS-PCA algorithm returns a full SVD decomposition of the original matrix 
X, with a maximum error e for eigenvalues and perfectly orthogonal left/right 
eigenvectors. 

3 Implementation Details 

The newly developed CPUs now include fully programmable processing units 
that follow a stream programming model and support vectorized single and 
double precision floating-point operations. For example, the CUDA computing 
environment provides a standard C like language interface to the NVIDIA GPUs 
[NVIDIA, 2008| . The computation is distributed into sequential grids, which are 
organized as a set of thread blocks. The thread blocks are batches of threads that 
execute together, sharing local memories and synchronizing at specified barriers. 
An enormous number of blocks, each containing maximum 512 threads, can be 
launched in parallel in the grid. 

In our implementation of NIPALS-PCA and GS-PCA algorithms we use 
CUBLAS, a recent parallel implementation of BLAS, developed by NVIDIA on 
top of the CUDA programming environment [NVIDIA, 2008| . CUBLAS library 
provides functions for: 

• creating and destroying matrix and vector objects in GPU memory; 

• transferring data from CPU mainmemory to GPU memory; 

• executing BLAS on the GPU; 

• transferring data from GPU memory back to the CPU mainmemory. 

BLAS defines a set of low-level fundamental operations on vectors and matri- 
ces which can be used to create optimized higher-level linear algebra function- 
ality. Highly efficient implementations of BLAS exist for most current com- 
puter architectures and the specification of BLAS is widely adopted in the 
development of high quality linear algebra software, such as the GNU Scien- 
tific Library (GSL) jCalassi et al, 2006[ . We have selected GSL CBLAS, for 
our host (CPU) implementation, due to its portability on various platforms 
(Windows/Linux/OSX, Intel/AMD) and because it is free and easy to use in 
combination with GCC (GNU Compiler). The GSL library provides a low-level 
layer which corresponds directly to the C-language BLAS standard, referred 
here as CBLAS, and a higher- level interface for operations on GSL vectors and 
matrices. 

The CBLAS (GNU Scientific Library) and respectively CUBLAS (NVIDIA 
CUDA) implementations of the NIPALS-PCA and GS-PCA algorithms require 
the following Level 1, 2, 3 BLAS functions (see the CBLAS/CUBLAS program- 
ming manuals for definition details): 
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CBLAS (Level 2): gsLblas.dgemv 
CUBLAS (Level 2): cublasDgemv 

• computes in double precision the matrix- vector product and sum: 

y <— aAx + j3 or y <— aA^x + /3 (6) 

a and /? are double precision scalars, and x and y are double precision 
vectors. A is a matrix consisting of double precision elements. Matrix A 

is stored in column major format. 

CBLAS (Level 1): gsLblas_daxpy 
CUBLAS (Level 1): cublasDaxpy 

• computes the double precision sum: 

y ax + y, (7) 

multiplies double precision vector x by double precision scalar a and adds 
the result to double precision vector y. 

CBLAS (Level 1): gsl_blas_dnrm2 
CUBLAS (Level 1): cublasDnrm2 

• computes the Euclidean norm of a double precision vector x: 



M 
m—0 

CBLAS (Level 3): gsLblas.dger 
CUBLAS (Level 3): cublasDger 

• computes in double precision the matrix-matrix sum: 

A ^axy'^ + A, (9) 

where a is a double precision scalar, x is an M element double precision 
vector, y is an A'' element double precision vector, and A is an M x A'' 
matrix consisting of double precision elements. Matrix A is stored in 
column major format. 

These are the critical functions/kernels which are efficiently exploited in the 
parallel CUBLAS implementation. The other involved functions are for vec- 
tor/matrix memory allocation and vector/matrix accessing, device (GPU) ini- 
tialization, host-device data transfer and error handling. 

In the CUBLAS implementation, the data space is allocated both on host 
(CPU) mainmemory and on device (GPU) memory. After the data is initialized 
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on host it is transferred on device, where the main parallel computation occurs. 
The results are then transferred back on host memory. 

The double precision code for CBLAS and CUBLAS implementations are 
given in the Appendices 1. 2. 3 and 4. These implementations can be easily 
modified in order to meet the end user's specifications. For example the single 
precision BLAS data allocation and vector/matrix accessing functions have the 
prefix gsLvcctorJloat. gsl_matrix_float etc. (sec CBLAS/CUBLAS program- 
ming manuals for details). For convenience, we have included a timer which 
measures the time of all main operations involved by the algorithm. 

The CBLAS (GSL) versions can be compiled using the following commands: 

g++ -02 nipals_pca.cpp -Igsl -Igslcblas -Im 
g++ -02 gs_pca.cpp -Igsl -Igslcblas -Im 

An obvious requirement for the CUBLAS (NVIDIA) version is the presence 
of an NVIDIA GPU installed. The GPU must be minimum GTX260, GTX280 
or Tesla C1060, otherwise there is no support for double precision. Also, the 
NVIDIA Driver, Toolkit and SDK must be correctly instaled. In order to com- 
pile the CUBLAS (NVIDIA) version we recommend to use the following simple 
make file: 

####################################################### 

# Build script for the CUBLAS (NVIDIA) implementation # 

# of the NIPALS-PCA and GS-PCA project # 

# Add source files here 

# (comment/uncomment your choice) 

EXECUTABLE := nipals_pca # for NIPALS-PCA 

# EXECUTABLE := gs_pca # for GS-PCA 

# C/C++ source files (compiled with gcc / C++) 

# (comment/uncomment your choice) 

CFILES := nipals_pca.c # for NIPALS-PCA 

# CFILES := gs_pca.c # for GS-PCA 

# Additional libraries needed by the project 
USECUBLAS := 1 



# Rules and targets 

include . . / . . /common/common.mk 



4 Results and Conclusion 



The numerical tests have been carried out on the foUowing system: AMD Phe- 
nom 9950 CPU (2.6GHz); XFX GTX280 GPU; NVIDIA Linux 64-bit driver 
(177.67); CUDA Toolkit and SDK 2.0; Ubuntu Linux 64-bit 8.04.1, GNU Scien- 
tific Library V. 1.11; Compilers: GCC (GNU), NVCC (NVIDIA). The GPU used 
is a high end graphics card solution with 240 stream processors and 1Gb DDR3 
RAM, which supports both single and double precision and it is theoretically 
capable of 1 Tflop computational power. 

In Figure 1 we give the CPU vs GPU execution time as a function of the 
size of the randomly generated input matrix X {M G [5 x 10^, 1.5 x lO''], N = 
M/2,K = 10,e = 10"'^). The time gap between CPU and GPU increases 
very fast by increasing the size of the input matrix, and the CPU time versus 
the GPU time reaches a maximum for M — 1.5 x 10'', where the GPU is 
about 12 times faster than the CPU. The GS-PCA algorithm is only about 
5-7% slower than the standard NIPLALS-PCA algorithm, in both CPU and 
GPU implementation. These results also show that the GPU performance is 
dependent on the scale of the problem. Thus, in order to exploit efficiently the 
massive parallelism of CPUs and to effectively use the hardware capabilities, 
the problem itself needs to scale accordingly, such that thousands of threads are 
defined and used in computation. 
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Figure 1: GPU vs CPU execution time of the NIPALS-PCA and GS-PCA 
algorithms as a function of the size of the input matrix X (A/, N = M/2, K = 10, 
Red = NIPALS-PCA, Blue = GS-PCA). 

In conclusion, we have presented an iterative GS-PCA algorithm based on 
Gram-Schmidt re-orthogonalization. The GS-PCA algorithm assures the perfect 
orthogonality of both the loads and the scores, and thus totally eliminates the 
loss of orthogonality present in the standard NIPALS-PCA algorithm. Also, 
we have discussed the GPU parallel implementation of both NIPALS-PCA and 
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GS-PCA algorithms. Wc have shown that the GPU paraUcl optimized versions, 
based on CUBLAS (NVIDIA) are substantially faster (up to 12 times) than the 
CPU optimized versions based on CBLAS (GNU Scientific Library). 
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Appendix 1: nipals_pca.cpp 



// C/C++ example for the CBLAS (GNU Scientific Library) 
// implementation of the NIPALS-PCA algorithm 
// M. Andrecut (c) 2008 
// 

// Compile with: g++ -03 nipals.cpp -Igsl -Igslcblas -Im 

// includes, system 

#include <math.h> 
#include <time.h> 

// includes, GSL & CBLAS 

#include <gsl/gsl_vector .h> 
#include <gsl/gsl_matrix.h> 
#include <gsl/gsl_blas .h> 

// declarations 

int nipals_gsl (int , int, int, gsl_matrix *, 
gsl_matrix *, gsl_matrix *) ; 

int print_results (int , int, int, 

gsl_matrix *, gsl_matrix *, 
gsl_matrix *, gsl_matrix *) ; 

// main 

int main(int argc, char** argv) 
{ 

// PCA model: X = TP' + R 

// input: X, MxN matrix (data) 

// input : M = number of rows in X 

// input : N = number of columns in X 

// input : K = number of components (K<=N) 

// output: T, MxK scores matrix 
// output: P, NxN loads matrix 
// output: R, MxN residual matrix 

int M = 1000, m; 
int N = M/2, n; 
int K = 25; 
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printf("\nProblem dimensions: MxN=y,dxy,d, K=y,d\n" , M, N, K) ; 
// initialize sreind and clock 
srand (time (NULL)) ; 
clock_t start=clock() ; 
double htime; 

// initiallize some reindom test data X 

gsl_matrix *X = gsl_matrix_alloc(M, N) ; 
for(m=0; m<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

gsl_matrix_set(X, m, n, rand()/(double)RAND_MAX) ; 
} 

} 

// allocate memory for T, P, R 

gsl_matrix *T = gsl_matrix_alloc(M, K) ; 

gsl_matrix *P = gsl_matrix_alloc(N, K) ; 

gsl_matrix *R = gsl_matrix_alloc(M, N) ; 

htime = ( (double) clockO -start) /CLOCKS_PER_SEC; 

printf ( " \nTime for data allocation: %f\n", htime); 

// call the nipals_gsl() function 

start=clock() ; 

gsl_matrix_memcpy(R, X); 

nipals_gsl(M, N, K, T, P, R) ; 

htime = ( (double) clockO -start) /CLOCKS_PER_SEC; 

printf (" \n\nTime for MIPALS-PCA computation on host: %f\n", htime); 
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// the results are in T, P, R 
print_results(M, M, K, X, T, P, R) ; 
/ / memory clean up and shutdown 
gsl_matrix_free(R) ; 
gsl_matrix_free(P) ; 
gsl_matrix_f ree(T) ; 
gsl_matrix_free(X) ; 

printf ("\nPress ENTER to exit. . .\n") ; getcharO ; 

return EXIT_SUCCESS; 
> 

int nipals_gsl (int M, int N, int K, gsl_matrix *T, 
gsl_matrix *P, gsl_matrix *R) 

{ 

// PCA model: X = TP' + R 

// input: X, MxN matrix (data) 

/ / input : M = number of rows in X 

// input: N = number of columns in X 

// input: K = number of components (K<=M) 

// output: T, MxK scores matrix 
// output : P , NxK loads matrix 

// output: R, MxN residual matrix (X is initially copied in R) 

// maximum number of iterations 

int J = 10000; 

/ / max error 

double er = l.Oe-7; 

// some useful pointers 

double *a = (double*) calloc (1 , sizeof(a)); 
double *b = (double*) calloc (1, sizeof(b)); 
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int k, n, j; 

// mean center the data 

gsl_vector *U = gsl_vector_calloc(M) ; 

for(n=0; n<N; n++) 
{ 

gsl_blas_daxpy(l .0, &gsl_matrix_column(R, n). vector, U) ; 
} 

for(n=0; n<N; n++) 
{ 

gsl_blas_daxpy(-1.0/N, U, &gsl_matrix_coliiinn(R, n). vector); 
} 

for(k=0; k<K; k++) 
{ 

gsl_blas_dcopy (&gsl_matrix_column(R, k) .vector, 
&gsl_matrix_column(T, k). vector); 

*a = 0.0; 

for(j=0; j<J; j++) 
{ 

gsl_blas_dgemv(CblasTrans, 1.0, R, 

&gsl_matrix_column(T, k) .vector, 

0.0, &gsl_matrix_column(P, k). vector); 

gsl_blas_dscal(l .0/gsl_blas_dnrm2(&gsl_matrix_coluinn(P, k) .vector) , 
&gsl_matrix_column(P, k). vector); 

gsl_blas_dgemv(CblasNoTrans , 1.0, R, 

&gsl_inatrix_column(P, k). vector, 

0.0, fegsl_matrix_coli.]itiTi (T, k). vector); 

*b = gsl_blas_dnrm2 (&gsl_matrix_column(T, k). vector); 

if (fabs(*a - *b) < er*(*b)) break; 

*a = *b; 

} 

gsl_blas_dger(-l .0, &gsl_inatrix_column(T, k). vector, 
&gsl_matrix_column(P, k). vector, R) ; 

} 
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// clean up memory 
free (a) ; 
free(b) ; 

gsl_vector_f ree(U) ; 

return EXIT_SUCCESS ; 
} 

int print_results (int M, int N, int K, 

gsl_matrix *X, gsl_matrix *T, 
gsl_matrix *P, gsl_matrix *R) 

i 

int m, n; 

// If M < 13 print the results on screen 

if (M > 12) return EXIT_SUCCESS ; 

printf ("\nX\n") ; 

for(m=0; m<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

printf ("7,+f ", gsl_matrix_get(X, m, n)); 

} 

printf ("\n") ; 

} 

printf ("\nT\n") ; 

for(m=0; m<M; m++) 
{ 

for(n=0; n<K; n++) 

{ 

printf (""/o+f ", gsl_matrix_get(T, m, n) ) ; 
} 

printf ("\n") ; 

} 

gsl_matrix *F = gsl_matrix_alloc(K, K) ; 

16 



gsl_blas_d.gemm (CblasTrans, CblasNoTrans , 1.0, T, T, 0.0, F); 

printf ("\nT' * T\n") ; 

for(m=0; in<K; ni++) 
{ 

for(n=0; n<K; n++) 
{ 

printf ("7.+f ", gsl_inatrix_get(F, m, n)); 
} 

printf ("\n") ; 

} 

gsl_matrix_free(F) ; 

printf ("\nP\n") ; 

for(m=0; in<N; m++) 
{ 

for(n=0; n<K; n++) 

{ 

printf ("°/o+f ", gsl_matrix_get(P, m, n) ) ; 
} 

printf ("\n") ; 
} 

gsl_matrix *G = gsl_matrix_alloc(K, K) ; 

gsl_blas_dgenim (CblasTrans, CblasNoTrans, 1.0, P, P, 0.0, G); 

printf ("\nP' * P\n") ; 

for(m=0; in<K; m++) 
{ 

for(n=0; n<K; n++) 

{ 

printf ("y.+f ", gsl_matrix_get(G, m, n) ) ; 
} 

printf ("\n") ; 
} 

gsl_matrix_free(G) ; 
printf ("\nR\n") ; 
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for(m=0; in<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

printf ("°/o+f ", gsl_matrix_get (R, m, n) ) ; 
} 

printf ("\n") ; 
} 

return EXIT_SUCCESS ; 
} 
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Appendix 2: nipals_pca.c 



// C/C++ example for the CUBLAS (NVIDIA) 
// implementation of NIPALS-PCA algorithm 
// 

// M. Andrecut (c) 2008 



// includes, system 

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <time.h> 



// includes, cuda 
#include <cublas.h> 



// matrix indexing convention 



#define id(m, n. Id) (((n) * (Id) + (m))) 



// declarations 



int nipals_cublas (int , int , int , 

double * , double * , 
double *) ; 

int print_results(int , int, int, 

double *, double *, 
double *, double *) ; 

// main 

int main(int argc, char** argv) 
{ 

// PCA model: X = T * P' + R 

// input: X, MxN matrix (data) 

// input : M = number of rows in X 

// input : N = number of columns in X 

// input: K = number of components (K<=M) 

// output: T, MxK scores matrix 
// output: P, NxN loads matrix 
// output: R, MxN residual matrix 
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int M = 1000, m; 
int N = M/2, n; 
int K = 25; 

printf("\nProblem dimensions: MxN=%dx7.d , K='/.d\n", M, N, K) ; 

// initialize srand aind clock 

srand (time (NULL)) ; 

clock_t start=clock() ; 

double dtime; 

// initialize cublas 

cublasStatus status; 

status = cublaslnit ; 

if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr , "! CUBLAS initialization error \n") ; 

return EXIT_FAILURE ; 

} 

// initiallize some random test data X 
double *X; 

X = (double* )malloc(M*N * sizeof (X [0] ) ) ; 

if(X == 0) 
{ 

f printf (stderr , "! host memory allocation error: X\n"); 

return EXIT_FAILURE ; 

} 

for(m = 0; m < M; m++) 
{ 

for(n = 0; n < M; n++) 

{ 

X[id(m, n, M)] = randO / (double) RAND_MAX; 
} 

} 
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// allocate host memory for T, P, R 
double *T; 

T = (double*)malloc(M*K * sizeof (T [0] ) ) ; ; 

if(T == 0) 
{ 

f printf (stderr , "! host memory allocation error: T\n"); 

return EXIT_FAILURE; 

} 

double *P; 

P = (double*)malloc(N*K * sizeof (P[0])) ; ; 

if(P == 0) 
{ 

f printf (stderr , "! host memory allocation error: P\n"); 

return EXIT_FAILURE ; 

} 

double *R; 

R = (double* )malloc(M*N * sizeof (R[0] )); ; 

if (R == 0) 
{ 

f printf (stderr , "! host memory allocation error: R\n"); 

return EXIT_FAILURE ; 

} 

dtime = ( (double) clockO - start) /CLOCKS _PER_SEC; 
printf (" \nTime for data allocation: 7,f\n", dtime); 
// call nipals_cublas 
start=clock() ; 

memcpy(R, X, M*N * sizeof (X [0] )) ; 

nipals_cublas(M, N, K, T, P, R) ; 

dtime = ( (double) clockO - start) /CLOCKS _PER_SEC; 
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printf ("\nTime for NIPALS-PCA computation on device: 7of\n", dtime) ; 

print .results (M, N, K, X, T, P, R) ; 

// memory clean up 

free(R) ; 

free(P) ; 

free(T) ; 

free(X) ; 

// shutdown 

status = cublasShutdownO ; 

if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr, "! cublas shutdown error\n"); 

return EXIT_FAILURE; 

} 

if(argc <= 1 I I strcmp(argv[l] , " -noprompt " ) ) 
{ 

printf ("\nPress ENTER to exit...\n"); 

get char () ; 

} 

return EXIT_SUCCESS; 
} 



int nipals_cublas(int M, int N, int K, 
double *T, double *P, 
double *R) 

i 

II PCA model: X = T * P' + R 

// input: X, MxN matrix (data) 

/ / input : M = number of rows in X 

/ / input : N = number of columns in X (N<=M) 

// input: K = number of components (K<M) 
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// output: T, MxK scores matrix 
// output: P, NxK loads matrix 
// output: R, MxN residual matrix 

// CUBLAS error handling 

cublasStatus status; 

// meiximum number of iterations 

int J = 10000; 

// max error 

double er = l.Oe-7; 

int k, n, j; 

// treinsfer the host matrix X to device matrix dR 
double *dR = 0; 

status = cublasAlloc(M*N, sizeof (dR[0] ) , (void**)&dR) ; 

if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr, "! device memory allocation error (dR)\n"); 

return EXIT_FAILURE ; 

} 

status = cublasSetMatrix(M, N, sizeof (R[0] ) , R, M, dR, M) ; 

if (status != CUBLAS_STATUS_SUCCESS) 

{ 

fprintf (stderr , "! device access error (write dR)\n"); 

return EXIT_FAILURE ; 

} 

// allocate device memory for T, P 
double *dT = 0; 

status = cublasAlloc(M*K, sizeof (dT [0] ) , (void**)&dT) ; 
if (status != CUBLAS_STATUS_SUCCESS) 
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f printf (stderr , "! device memory allocation error (dT)\n 

return EXIT_FAILURE; 

} 

double *dP = 0; 

status = cublasAlloc(N*K, sizeof (dP [0] ) , (void**)&dP) ; 

if (status != CUBLAS_STATUS_SUCCESS) 

{ 

f printf (stderr , "! device memory allocation error (dP)\n 

return EXIT_FAILURE; 

} 

// mean center the data 
double *dU = 0; 

status = cublasAlloc(M, sizeof (dU[0] ) , (void**)&dU) ; 

if (status != CUBLAS_STATUS_SUCCESS) 

{ 

f printf (stderr , "! device memory allocation error (dU)\n 

return EXIT_FAILURE; 

} 

cublasDcopy(M, &dR[0] , 1, dU, 1); 

for(n=l; n<N; n++) 
{ 

cublasDaxpy(M, 1.0, &dR[n*M] , 1, dU, 1); 
} 

for(n=0; n<N; n++) 
{ 

cublasDaxpy(M, -1.0/N, dU, 1, &dR[n*M] , 1); 
} 

double a, b; 

for(k=0; k<K; k++) 
{ 

cublasDcopy(M, &dR[k*M] , 1, &dT[k*M], 1); 
a = 0.0; 
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for(j=0; j<J; j++) 
{ 

cublasDgemvCt' , M, M, 1.0, dR, M, &dT[k*M] , 1, 0.0, &dP[k*M], 1) ; 

cublasDscaKN, 1 . 0/cublasDnrin2(N, &dP [k*N] , 1), &dP [k*N] , 1); 

cublasDgemvCn' , M, M, 1.0, dR, M, &dP[k*N], 1, 0.0, &dT[k*M], 1) ; 

b = cublasDnrin2(M, &dT[k*M], 1); 

if (fabs(a - b) < er*b) break; 

a = b; 
} 

cublasDgerCM, N, -1.0, &dT[k*M], 1, &dP[k*N], 1, dR, M) ; 
} 

// transfer device dT to host T 

cublasGetMatrix(M, K, sizeof (dT [0] ) , dT, M, T, M) ; 
// transfer device dP to host P 

cublasGetMatrix(N, K, sizeof (dP [0] ) , dP, M, P, N) ; 
// transfer device dR to host R 

cublasGetMatrix(M, N, sizeof (dR[0] ) , dR, M, R, M) ; 
// clean up memory- 
status = cublasFree(dP) ; 
status = cublasFree(dT) ; 
status = cublasFree (dR) ; 

return EXIT_SUCCESS ; 
} 



int print _results(int M, int N, int K, 

double *X, double *T, 
double *P, double *R) 

{ 

int m, n, k; 
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// If M < 13 print the results on screen 

if (M > 12) return EXIT_SUCCESS ; 

printf ("\nX\n") ; 

for(m=0; in<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

printf ("7.+f ", X[id( m, n,M)]); 

} 

printf ("\n") ; 

} 

printf ("\nT\n") ; 

for(m=0; in<M; m++) 
{ 

for(n=0; n<K; n++) 

{ 

printf ("y,+f ", T[id(in, n, M)]); 
} 

printf ("\n") ; 
} 

double a; 

printf ("\nT' * T\n") ; 

for(m = 0; in<K; in++) 
{ 

for(n=0; n<K; n++) 
{ 

a=0; 

for(k=0; k<M; k++) 
i 

a = a + T[id(k, m, M)] * T[id(k, n, M)] ; 

} 

printf ("y.+f ", a); 
} 

printf ("\n") ; 
} 

printf ("\nP\n") ; 
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for(m=0; in<N; m++) 
{ 

for(n=0; n<K; n++) 
{ 

printf("7.+f ", P[id(in, n, N)]); 

} 

printf ("\n") ; 
} 

printf ("\nP' * P\ii") ; 

for(m = 0; m<K; 111++) 
{ 

for(n=0; n<K; n++) 

{ 

a=0; 

for(k=0; k<M; k++) 
{ 

a = a + P[id(k, m, N)] * P[id(k, 
} 

printf ("7.+f ", a); 
} 

printf ("\n") ; 
} 

printf ("\nR\n") ; 

for(m=0; m<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

printf ("7.+f ", R[id( m, n,M)]); 

} 

printf ("\n") ; 
} 

return EXIT_SUCCESS ; 
> 
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Appendix 3: gs_pca.cpp 



// C/C++ example for the CBLAS (GNU Scientific Library) 

// implementation of PCA-GS algorithm 

// 

// M. Andrecut (c) 2008 

// includes, system 

#include <math.h> 

#include <time.h> 

// includes, GSL & CBLAS 

#include <gsl/gsl_vector .h> 

#include <gsl/gsl_matrix.h> 

#include <gsl/gsl_blas .h> 

// declarations 

int gs_pca_gsl(int, int, int, 

gsl_matrix *, gsl_matrix *, 
gsl_matrix *) ; 

int print_results (int , int, int, 

gsl_matrix *, gsl_matrix *, 
gsl_matrix *, gsl_matrix *) ; 

// main 

int main(int argc, char** argv) 
i 

II PCA model: X = TP' + R 

// input: X, MxN matrix (data) 

/ / input : M = number of rows in X 

// input : N = number of columns in X 

// input: K = number of components (K<=N) 

// output: T, MxK scores matrix 
// output: P, NxK loads matrix 
// output: R, MxN residual matrix 
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int M = 1000, m; 
int N = M/2, n; 
int K = 10; 

printfC'XnProblem dimensions: MxN=%dxy.d , K=%d" , M, N, 
// initialize sreind and clock 
srand (time (NULL)) ; 
clock_t start=clock() ; 
double htime; 

// initiallize some random test data X 

gsl_matrix *X = gsl_matrix_alloc (M, N) ; 

for(m=0; m<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

gsl_matrix_set(X, m, n, randO / (double)RAND_MAX) ; 
} 

} 

// allocate memory for T, P, R 

gsl_matrix *T = gsl_matrix_alloc (M, K) ; 

gsl_matrix *P = gsl_matrix_alloc (N, K) ; 

gsl_matrix *R = gsl_matrix_alloc (M, N) ; 

htime = ( (double) clockO -start) /CLOCKS_PER_SEC; 

printf ("\nTime for data allocation: %f\n", htime); 

// call gs_pca_gsl 

start=clock() ; 

gsl_matrix_memcpy (R, X) ; 

gs_pca_gsl(M, N, K, T, P, R) ; 
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htime = ( (double) clockO -start) /CLOCKS_PER_SEC; 

printf ("\n\nTime for GS-PCA host computation: 7,f\n", htime); 

// the results are in T, P, R 

print_results(M, N, K, X, T, P, R) ; 

// memory clean up and shutdown 

gsl_matrix_f ree(R) ; 

gsl_matrix_free(P) ; 

gsl_matrix_free(T) ; 

gsl_matrix_f ree(X) ; 

printf ("\nPress ENTER to exit. . .\n") ; getcharO ; 

return EXIT_SUCCESS ; 
> 

int gs_pca_gsl(int M, int N, int K, 

gsl_matrix *T, gsl_matrix *P, 
gsl_matrix *R) 

{ 

// PCA model: X = TLP' + R 

// input: X, MxN matrix (data, copied in R) 

// input : M = number of rows in X 

// input: N = number of columns in X 

// input: K = number of components (K<=M) 

// output: T, MxK left eigenvectors 

// output: P, NxK right eigenvectors 

// output: L, Kxl eigenvalues 

// output: R, MxN residual 

// maximum number of iterations 

int J = 10000; 

// max error 
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double er = l.Oe-7; 
double a; 

int n, k, j; 

// mean center the data 

gsl_vector *U = gsl_vector_calloc (M) ; 

for(n=0; n<N; n++) 
{ 

gsl_blas_daxpy(l .0, &gsl_matrix_column(R, n). vector, U) ; 
} 

for(n=0; n<N; n++) 
{ 

gsl_blas_daxpy(-1.0/N, U, &gsl_matrix_column(R, n). vector); 
} 

// allocate memory fo eigenvalues 

gsl_vector *L = gsl_vector_alloc(K) ; 

// gs_pca 

for(k=0; k<K; k++) 
{ 

gsl_blas_dcopy (&gsl_matrix_column(R, k) .vector, 
&gsl_matrix_column(T, k). vector); 

a = 0.0; 

for(j=0; j<J; j++) 

{ 

gsl_blas_dgemv(CblasTrans , 1.0, R, 

&gsl_matrix_column(T, k). vector, 

0.0, &gsl_matrix_column(P, k). vector); 

if (k>0) 
{ 

gsl_blas_dgemv(CblasTrEins, 1.0, 

&gsl_matrix_submatrix (P, 0, 0, M, k) .matrix, 
&gsl_matrix_column(P , k). vector, 0.0, 
&gsl_vector_subvector (U, 0, k). vector); 

gsl_blas_dgemv(CblasNoTrans , -1.0, 
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&gsl_matrix_submatrix (P, 0, 0, N, k) .matrix, 
&gsl_vector_subvector (U, 0, k). vector, 1.0, 
&gsl_inatrix_coli]itiTi (P, k). vector); 

} 

gsl_blas_dscal (1.0/ gsl_blas_dnrin2 (fegsl_matrix_coli.iitiTi (P , k) .vector) , 
&gsl_matrix_coliiinn(P, k). vector); 

gsl_blas_dgemv(CblasMoTrans, 1.0, R, &gsl_inatrix_column(P, k). vector, 
0.0, fegsl_matrix_coli]itiTi (T, k). vector); 

if (k>0) 
{ 

gsl_blas_dgemv(CblasTrEins, 1.0, 

&gsl_inatrix_subinatrix (T, 0, 0, M, k) .matrix, 
&gsl_matrix_column(T, k). vector, 0.0, 
&gsl_vector_subvector (U, 0, k). vector); 

gsl_blas_dgemv(CblasNoTrajis , -1.0, 

&gsl_matrix_submatrix (T, 0, 0, M, k) .matrix, 
&gsl_vector_subvector (U, 0, k). vector, 1.0, 
&gsl_matrix_coliimn(T, k). vector); 

} 

gsl_vector_set(L, k, gsl_blas_dnrm2(&gsl_matrix_coliimn(T, k). vector)); 

gsl_blas_dscal (1.0/ gsl_vector_get (L , k) , 

&gsl_matrix_coliimn(T, k). vector); 

if (fabs(a - gsl_vector_get(L, k)) < er*gsl_vector_get(L, k)) break; 

a = gsl_vector_get(L, k) ; 
} 

gsl_blas_dger (-gsl_vector_get(L, k) , 

&gsl_matrix_columii(T, k). vector, 
&gsl_matrix_coliimn(P, k). vector, R) ; 

} 

for(k=0; k<K; k++) 

{ 

gsl_blas_dscal (gsl_vector_get(L, k) , 

&gsl_matrix_coliimn(T, k). vector); 

} 

// memory clean up 
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gsl_vector_free(L) ; 

gsl_vector_free(U) ; 

return EXIT_SUCCESS; 
> 

int print _results(int M, int N, int K, 

gsl_matrix *X, gsl_matrix *T, 
gsl_matrix *P, gsl_matrix *R) 

{ 

int m, n; 

/* If M < 13 print the results on screen */ 

if (M > 12) return EXIT_SUCCESS; 

printf ("\nX\n") ; 

for(m=0; in<M; ni++) 
{ 

for(n=0; n<N; n++) 
{ 

printf ("7.+f ", gsl_inatrix_get(X, m, n)); 
} 

printf ("\n") ; 
} 

printf ("\nT\n") ; 

for(m=0; m<M; m++) 
{ 

for(n=0; n<K; n++) 
{ 

printf ("7.+f ", gsl_inatrix_get(T, m, n)); 
} 

printf ("\n") ; 
} 

gsl_matrix *F = gsl_inatrix_alloc (K, K) ; 

gsl_blas_dgenmi(CblasTrans, CblasMoTrans , 1.0, T, T, 0.0, F) ; 
printf ("\nT' * T\n") ; 
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for(m=0; in<K; m++) 
{ 

for(n=0; n<K; n++) 
{ 

printf ("°/o+f ", gsl_matrix_get (F, m, n) ) ; 
} 

printf ("\n") ; 
} 

gsl_matrix_free(F) ; 

printf ("\nP\n") ; 

for(m=0; in<N; m++) 
{ 

for(n=0; n<K; n++) 
{ 

printf ("7,+f ", gsl_inatrix_get(P, m, n)); 
} 

printf ("\n") ; 
} 

gsl_matrix *G = gsl_matrix_alloc (K, K) ; 

gsl_blas_dgenmi (CblasTrans, CblasNoTrans , 1.0, P, P, 0.0, G); 

printf ("\nP' * P\n") ; 

for(m=0; in<K; m++) 
{ 

for(n=0; n<K; n++) 
{ 

printf ("7,+f ", gsl_inatrix_get(G, m, n)); 
} 

printf ("\n") ; 
} 

gsl_matrix_f ree(G) ; 
printf ("\nR\n") ; 

for(m=0; in<M; ni++) 

■[ 

for(n=0; n<N; n++) 
{ 

printf ("7.+f ", gsl_inatrix_get(R, m, n)); 



} 

printf ("\n") ; 
} 

return EXIT_SUCCESS; 
> 



Appendix 4: gs_pca.c 



// C/C++ example for the CUBLAS (NVIDIA) 
// implementation of PCA-GS algorithm 
// 

// M. Andrecut (c) 2008 

// includes, system 

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <time.h> 

// includes, cuda 

#include <cublas.h> 

// matrix indexing convention 

#define id(m, n. Id) (((n) * (Id) + (m))) 

// declarations 

int gs_pca_cublas(int, int, int, double *, double *, double *) ; 

int print_results(int, int, int, double *, double *, double *, double *) ; 

// main 

int main(int argc, char** argv) 
{ 

// PCA model: X = TP' + R 

// input: X, MxN matrix (data) 

// input : M = number of rows in X 

/ / input : N = number of columns in X 

// input : K = number of components (K<=N) 

// output: T, MxK scores matrix 
// output: P, NxK loads matrix 
// output: R, MxN residual matrix 

int M = 1000, m; 
int N = M/2, n; 
int K = 10; 



36 



printf("\nProblem dimensions: MxN=7.dx7.d , K=7.d", M, M, K) ; 

// initialize sreind and clock 

srand (time (NULL)); 

clock_t start=clock() ; 

double dtime; 

// initialize cublas 

cublasStatus status; 

status = cublasInitO; 

if (status != CUBLAS_STATUS_SUCCESS) 
{ 

f printf (stderr , "! CUBLAS initialization error \n") ; 

return EXIT_FAILURE; 

} 

// initiallize some raindom test data X 
double *X; 

X = (double*)malloc(M*N * sizeof (X [0] ) ) ; 

if(X == 0) 
{ 

fprintf (stderr, "! host memory allocation error: X\n"); 

return EXIT_FAILURE; 

} 

for(m =0; m < M; m++) 
{ 

for(n = 0; n < M; n++) 
{ 

X[id(m, n, M)] = randO / (double) RAND_MAX; 
} 

} 

// allocate host memory for T, P, R 
double *T; 
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T = (double*)inalloc(M*K * sizeof (T [0] ) ) ; ; 
if (T == 0) 
{ 

f printf (stderr , "! host memory allocation error: T\n"); 

return EXIT_FAILURE; 

} 

double *P; 

P = (double*)malloc(M*K * sizeof (P[0])) ; ; 

if(P == 0) 
{ 

f printf (stderr , "! host memory allocation error: P\n"); 

return EXIT_FAILURE ; 

} 

double *R; 

R = (double* )malloc(M*N * sizeof (R[0] )); ; 

if (R == 0) 
{ 

f printf (stderr , "! host memory allocation error: R\n"); 

return EXIT_FAILURE ; 

} 

dtime = ( (double) clockO -start) /CLOCKS_PER_SEC; 
printf (" \nTime for data allocation: %f\n", dtime); 
// call gs_pca_cublas 
start=clock() ; 

memcpy(R, X, M*N * sizeof (X [0] )) ; 

gs_pca_cublas(M, N, K, T, P, R) ; 

dtime = ( (double) clockO -start) /CLOCKS_PER_SEC; 

printf (" \nTime for device GS-PCA computation: °/.f\n", dtime); 

// the results are in T, P, R 

print_results(M, M, K, X, T, P, R) ; 
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// clean up memory 
free(P) ; 
free(T) ; 
free(X) ; 
// shutdown 

status = cublasShutdownO ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr, "! cublas shutdown error\n"); 

return EXIT_FAILURE ; 

} 

if(argc <= 1 I I strcmp(argv[l] , " -noprompt " ) ) 
{ 

printf ("\nPress ENTER to exit...\n"); getcharO; 
} 

return EXIT_SUCCESS ; 
> 



int gs_pca_cublas (int M, int N, int K, 
double *T, double *P, 
double *R) 

{ 

// PCA model: X = TP' + R 

// input: X, MxN matrix (data) 

// input : M = number of rows in X 

// input: N = number of columns in X 

// input: K = number of components (K<=M) 

// output: T, MxK scores matrix 
/ / output : P , NxK loads matrix 
// output: R, MxN residual matrix 

cublasStatus status; 

// maximum number of iterations 



39 



int J = 10000; 
// maix error 
double er = l.Oe-7; 
int n, j, k; 

// treinsfer the host matrix X to device matrix dR 
double *dR = 0; 

status = cublasAlloc(M*N, sizeof (dR[0] ) , (void**)&dR) ; 

if (status != CUBLAS_STATUS_SUCCESS) 

{ 

fprintf (stderr, "! device memory allocation error (dR)\n" 

return EXIT_FAILURE ; 

} 

status = cublasSetMatrix(M, N, sizeof (R[0] ) , R, M, dR, M) ; 

if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr, "! device access error (write dR)\n"); 

return EXIT_FAILURE; 

} 

// allocate device memory for T, P 
double *dT = 0; 

status = cublasAlloc(M*K, sizeof (dT[0] ) , (void**)&dT) ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr, "! device memory allocation error (dT)\n" 

return EXIT_FAILURE ; 

} 

double *dP = 0; 

status = cublasAlloc(N*K, sizeof (dP [0] ) , (void**)&dP) ; 

if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr, "! device memory allocation error (dP)\n" 
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return EXIT_FAILURE; 
} 

// allocate memory for eigenvalues 
double *L; 

L = (double* )malloc(K * sizeof (L [0] ) ) ; ; 

if(L == 0) 
{ 

fprintf (stderr, "! host memory allocation error: T\n"); 

return EXIT_FAILURE; 

} 

// mean center the data 
double *dU = 0; 

status = cublasAlloc(M, sizeof (dU[0] ) , (void**)&dU) ; 
if (status != CUBLAS_STATUS_SUCCESS) 
{ 

fprintf (stderr, "! device memory allocation error (dU)\n"); 

return EXIT_FAILURE; 

} 

cublasDcopy(M, &dR[0] , 1, dU, 1); 

for(n=l; n<N; n++) 
{ 

cublasDaxpy (M, 1.0, &dR[n*M] , 1, dU, 1); 
} 

for(n=0; n<N; n++) 
{ 

cublasDaxpy (M, -1.0/N, dU, 1, &dR[n*M] , 1); 
} 

// GS-PCA 

double a; 

for(k=0; k<K; k++) 
{ 

cublasDcopy (M, &dR[k*M] , 1, &dT[k*M], 1); 
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a = 0.0; 

for(j=0; j<J; j++) 
{ 

cublasDgemv ('t', M, N, 1.0, dR, M, &dT[k*M], 1, 0.0, &dP [k*N] , 1) ; 

if (k>0) 
{ 

cublasDgemv ('t', M, k, 1.0, dP, M, &dP[k*N], 1, 0.0, dU, 1); 

cublasDgemv ('n', N, k, -1.0, dP, N, dU, 1, 1.0, &dP [k*N] , 1); 
} 

cublasDscal (N, 1.0/cublasDnrm2(N, &dP[k*N], 1), &dP[k*M], 1); 

cublasDgemv ('n', M, N, 1.0, dR, M, &dP[k*N], 1, 0.0, &dT[k*M], 1); 

if (k>0) 
{ 

cublasDgemv ('t', M, k, 1.0, dT, M, &dT[k*M], 1, 0.0, dU, 1); 

cublasDgemv ('n', M, k, -1.0, dT, M, dU, 1, 1.0, &dT[k*M], 1); 
} 

L[k] = cublasDnrm2(M, &dT[k*M], 1); 
cublasDscal (M, 1 . 0/L [k] , &dT[k*M], 1); 
if(fabs(a - L[k]) < er*L[k]) break; 
a = L [k] ; 
} 

cublasDger (M, N, - L[k], &dT[k*M], 1, &dP[k*M], 1, dR, M) ; 
} 

for(k=0; k<K; k++) 

-[ 

cublasDscal (M, L [k] , &dT[k*M], 1); 
} 

// transfer device dT to host T 

cublasGetMatrix (M, K, sizeof (dT [0] ) , dT, M, T, M) ; 
// transfer device dP to host P 
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cublasGetMatrix (M, K, sizeof (dP [0] ) , dP, N, P, N) ; 
// transfer device dR to host R 

cublasGetMatrix (M, N, sizeof (dR[0] ) , dR, M, R, M) ; 
// clean up memory- 
free (L) ; 

status = cublasFree(dP) ; 

status = cublasFree(dT) ; 

status = cublasFree (dR) ; 

return EXIT_SUCCESS; 
> 



int print_results(int M, int N, int K, 

double *X, double *T, double *P, 
double *R) 

{ 

int m, n, k; 

// If M < 13 print the results on screen 

if (M > 12) return EXIT_SUCCESS ; 

printf ("\nX\n") ; 

for(m=0; m<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

printf ("7.+f ", X[id( m, n,M)]); 
} 

printf ("\n") ; 
} 

printf ("\nT\n") ; 

for(m=0; m<M; m++) 
{ 

for(n=0; n<K; n++) 
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i 

priiitf("%+f ", T[id(in, n, M)]); 
} 

printf ("\n") ; 
} 

double a; 

printf ("\nT' * T\n") ; 

for(m = 0; in<K; 111++) 
{ 

for(n=0; n<K; n++) 
{ 

a=0; 

for(k=0; k<M; k++) 
i 

a = a + T[id(k, m, M)] * T[id(k, n, M)] ; 

} 

printf ("y,+f ", a); 
} 

printf ("\n") ; 
} 

printf ("\nP\n") ; 

for(m=0; ni<N; m++) 
{ 

for(n=0; n<K; n++) 
{ 

printf("%+f ", P[id(in, n, N)]); 
} 

printf ("\n") ; 
} 

printf ("\nP' * P\n") ; 

for(m = 0; in<K; m++) 
{ 

for(n=0; n<K; n++) 
{ 

a=0; for(k=0; k<N; k++) a = a + P[id(k, m, M)] * P[id(k, n, N)] ; 

printf ("7.+f ", a); 

} 

printf ("\n") ; 
} 
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printf ("\nR\n") ; 

for(m=0; in<M; m++) 
{ 

for(n=0; n<N; n++) 
{ 

printf ("y.+f ", R[id( m, n,M)]); 
} 

printf ("\n") ; 
} 

return EXIT_SUCCESS ; 
} 
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